- CellxGene
- Find Published Data
- Contribute and Publish Data
- Download Published Data
- Analyze Public Data
- Get Started
- Hosted Tutorials
- Gene Expression Documentation
- Annotate and Analyze Your Data
- Join the CellxGene User Community
- Cite cellxgene in your publications
- Frequently Asked Questions
- Learn About Single Cell Data Analysis
Contributing Data
CELLxGENE supports a rapidly growing single-cell data corpus because of generous contributions from researchers like you!
Submission and Publication Process
- Review the Data Eligibility criteria to ensure your data complies with these requirements
- Contact us with a description of the data that you'd like to contribute to confirm that we will accept your data
- Once confirmed, you send us files prepared according to the submission Requirements
- We upload to a private Collection where you can review
- The submission can be revised, as needed
- The data are made openly available when you are ready
Dataset Requirements
Data Eligibility
CELLxGENE supports most single-cell RNA-seq and ATAC-seq data, but a few types of data are not accepted at this time:
- drug screens
- cell lines
- species not on the supported list
- assays not on the supported list
- these additional assays are pending Census acceptance and will be accepted:
- expression measurements from multi-modal assays (e.g. 10x multiome, mCT-seq)
- unpaired ATAC-seq measurements
- these additional assays are pending Census acceptance and will be accepted:
CELLxGENE continues to expand support for additional species and assays so please contact us if you are interested in submitting data not currently covered by the supported lists.
Scale Constraints
CELLxGENE Discover sets the maximum per dataset file size for submissions to 50 GB. Additionally, datasets with more than 4.6 million cells can be submitted but will not visualized in CELLxGENE Explorer.
Formatting Requirements
Include the following Collection metadata in your emails to describe your publication or study, all of which can be edited as titles, abstracts, etc. change:
- Collection information:
- Title
- Description
- Contact: a single name and email
- Publication/preprint DOI: optional
- URLs optional
- any links to the corresponding raw sequence data, protocols, and other related data or resources
- Consortia optional
- one or more of those listed here
The full schema is documented here but is summarized below. Each dataset needs the following information added to a single h5ad (AnnData 0.10) format file:
- Dataset-level metadata in uns:
- title
- batch_condition optional
- list of obs fields that define “batches” that a normalization or integration algorithm should be aware of
- default_embedding optional
- the obsm key associated with the embeddings you would like to be displayed in CELLxGENE Explorer by default
- Data in .X and raw.X:
- raw counts are required
- normalized counts are strongly recommended
- raw counts should be in raw.X if normalized counts are in .X
- if there is no normalized matrix, raw counts should be in .X
- Cell metadata in obs (ontology term IDs MUST be the most specific term available from the specified ontology):
- organism_ontology_term_id
- NCBITaxon See the schema for specific values
- donor_id
- free-text identifier that distinguishes the unique individual that data were derived from
- development_stage_ontology_term_id
- sex_ontology_term_id
PATO:0000384
for male,PATO:0000383
for female,PATO:0001340
for hermaphrodite, orunknown
if unavailable
- self_reported_ethnicity_ontology_term_id
- human HANCESTRO
- multiple comma-separated terms may be used if more than one ethnicity is reported
unknown
if information unavailable- other organisms
na
- disease_ontology_term_id
- MONDO or
PATO:0000461
for normal - not necessarily any known disease of the donor, but rather any known disease thought to, or being tested to, have an impact on the measurement being taken
- MONDO or
- tissue_type
tissue
,organoid
, orcell culture
- tissue_ontology_term_id
- cell_type_ontology_term_id
- assay_ontology_term_id
- suspension_type
cell
,nucleus
, orna
- organism_ontology_term_id
- Embeddings in obsm:
- one or more two-dimensional embeddings, prefixed with 'X_'
- Features in var & raw.var (if present):
- index is Ensembl gene ID
- preference is that genes have not been filtered in order to maximize future data integration efforts
- Additional standards for single-capture area Visium datasets (largely aligns with scanpy’s model, this notebook may be helpful to curate from Space Ranger outputs):
- empty spots must be included
- 4992 total observations for 6.5 mm capture areas
- 14336 total observations for 11 mm capture areas
- obsm['spatial']
- obs['array_row']
- obs['array_col']
- obs['in_tissue']
- uns['spatial'][library_id]['images']['fullres'] preferred
- fullres image that is input to Space Ranger
- uns['spatial'][library_id]['images']['hires']
- hires image that is output from Space Ranger
- uns['spatial'][library_id]['scalefactors']['spot_diameter_fullres']
- uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']
- multiple-capture area Visium datasets are permitted if each capture area is also submitted individually
- the addtional Visium standards do not apply to these
- empty spots must be included
- Additional standards for single-puck Slide-seq datasets:
- obsm['spatial']
- multiple-puck Slide-seq datasets are permitted if each puck is also submitted individually
- obsm['spatial'] is not required for these
- Additional ATAC-seq standards
- Fragment files are required for unpaired ATAC-seq submissions, preferred for multi-modal submissions
Data Submission Policy
I give CZI permission to display, distribute, and create derivative works (e.g. visualizations) of this data for purposes of offering CELLxGENE Discover, and I have the authority to give this permission. It is my responsibility to ensure that this data is not identifiable. In particular, I commit that I will remove any direct personal identifiers in the metadata portions of the data, and that CZI may further contact me if it believes more work is needed to de-identify it. If I choose to publish this data publicly on CELLxGENE Discover, I understand that (1) anyone will be able to access it subject to a CC-BY 4.0 license, meaning they can download, share, and use the data without restriction beyond providing attribution to the original data contributor(s) and (2) the Collection details (including Collection name, description, my name, and the contact information for the datasets in this Collection) will be made public on CELLxGENE Discover as well. I understand that I have the ability to delete the data that I have published from CELLxGENE Discover if I later choose to. This however will not undo any prior downloads or shares of such data.